The Sarprop Algorithm: a Simulated Annealing Enhancement to Resilient Back Propagation
نویسندگان
چکیده
Back Propagation and its variations are widely used as methods for training artificial neural networks. One such variation, Resilient Back Propagation (RPROP), has proven to be one of the best in terms of speed of convergence. Our SARPROP enhancement, based on Simulated Annealing, is described in this paper and is shown to increase the rate of convergence for some problems. The extension involves two complementary modifications: weight constraints early in training combine with noise to force the network to perform a more thorough search of the initial weight space, before allowing the network to refine its solutions as training continues. INTRODUCTION There have been a number of refinements made to the BP algorithm (Tollenaere, 1990; Jacobs, 1988; Fahlman, 1988), with arguably the most successful in general being the Resilient Back Propagation method or RPROP (Riedmiller and Braun, 1993; Riedmiller 1994). The two major differences between BP and RPROP are that RPROP modifies the size of the weight step taken adaptively, and the mechanism for adaptation in RPROP does not take into account the magnitude of the gradient (δΕ/δwij) as seen by a particular weight, but only the sign of the gradient (positive or negative). This allows the step size to be adapted without having the size of the gradient interfere with the adaptation process (Riedmiller, 1993). In a number of previous BP variants, the learning parameter, η, was varied adaptively (Tollenaere, 1990; Jacobs, 1988). There was, however, no account taken of the current gradient magnitude (which is combined with the learning parameter to give the step size). The size of the gradient is unforeseeable, and hence it has the potential to disrupt the adaptation of the learning parameter. The RPROP algorithm works by modifying each weight by an amount ∆wij(t), termed the update value (or learning parameter), in such a way as to decease the overall error. All update values are initialised to the value ∆0. Τhe update value is modified in the following manner: if the current gradient (δΕ/δwij(t)) multiplied by the gradient of the previous step is positive (that is the gradient direction has remained the same), then the update value is multiplied by a value η (which is greater than one). Similarly, if the gradient product is negative, the update value is multiplied by the value η (which is less than one). The update value remains the same if the product equals zero. Τhis results in the update value for each weight adaptively growing, or shrinking, as a result of the sign of the gradient seen by that weight. There are two limits placed on the updates values: a maximum ∆max, and a minimum ∆min. SARPROP While RPROP can be extremely fast in converging to a solution, it suffers from the same problem faced by all gradient descent based methods: it can often converge to local minima. SARPROP attempts to address this problem by using the method of Simulated Annealing (SA). SA methods are a well known technique in training artificial neural networks, and have been applied to the Back Propagation algorithm (Burton and Mpitsos, 1992) with good results in terms of speed of converge. SA, in general, involves the addition of a random noise factor during weight updates. The amount of noise added is often associated with a ‘temperature’ value which decreases the effect of the noise as training progresses. The addition of noise allows the network to move in a direction which is not necessarily the direction of steepest descent. The benefit this provides is that it can help the network escape from local minima. Burton and Mpitsos (1992) have also shown that SA can help increase the speed of convergence of the BP algorithm, and conclude that noise “...simply permits or facilitates greater access to such pathways that are not easily reached in the networks not containing noise...” There are two enhancements made to the RPROP algorithm to obtain SARPROP. First, a noise factor is introduced. Noise is added to a weight when both the error gradient changes sign in successive epochs, and the magnitude of the update value is less than a value proportional to the current RMS error. The amount of noise added is proportional to both the current RMS error value and a temperature factor (which is based on the current epoch). The reason for adding noise to the update value only when both the error gradient changes sign, and the update value is below a given setting, was to minimise the disturbance to the normal adaptation of the update value. Following this scheme means that the update value is only modified by noise when it has a relatively small value (indicating a number of previous gradient crossings). This can allow the weight to jump out of local minima (Figure 1), while minimising the disturbance to the adaptation process. The amount of noise added decreases as the training continues because of the temperature term. The noise added is also influenced by the current RMS error value, a technique used by Burton and Mpitsos (1992). The second modification involves the addition of a weight decay term to the error function. The weight decay term is also associated with a temperature factor, which results in the influence of this term decreasing as training proceeds. The new error gradient is shown below: δΕ/δwij SARPROP = δΕ/δwij k1 * wij * 2 The reason for adding a weight decay term to the error function is to constrain the weights to smaller values at the beginning of training. This permits a more complete search of the initial weight space before allowing the weight values to increase. The search is complemented by the simultaneous addition of noise to the weight updates. The intention is that by constraining the weights initially, a more promising solution can be obtained. This solution can then be expanded upon once the weight values are able to increase. E(wij) wij Figure 1: An example of a weight trajectory along an error surface, and the result of SARPROP’s noise addition. After enhancements have been incorporated, the resulting SARPROP algorithm is: ∀i,j: ∆ij(t) = ∆0 ∀i,j: δΕ/δwij(t-1) = 0 Repeat Compute Gradient δΕ/δw(t) For all weights and biases if (δΕ/δwij(t-1) * δΕ/δwij(t) > 0) then ∆ij(t) = minimum (∆ij(t-1) * η , ∆max) ∆wij(t) = sign (δΕ/δwij(t)) * ∆ij(t) wij(t+1) = wij(t) + ∆wij(t) δΕ/δwij(t-1) = δΕ/δwij(t) else if (δΕ/δwij(t-1) * δΕ/δwij(t) < 0) then if (∆ij(t-1) < k2* error) then ∆ij(t) = ∆ij(t-1) * η + k3*r*error*2 else ∆ij(t) = ∆ij(t-1) * η ∆ij(t) = maximum (∆ij(t), ∆min) δΕ/δwij(t-1) = 0 else if (δΕ/δwij(t-1) * δΕ/δwij(t) =0) then ∆wij(t) = sign (δΕ/δwij(t)) * ∆ij(t) wij(t+1) = wij(t) + ∆wij(t) δΕ/δwij(t-1) = δΕ/δwij(t) Until (converged) In the algorithm error corresponds to (normalised) RMS error, T corresponds to the temperature, k1 , k2,, and k3 are constants, and r is a random number between 0 and 1. Noise
منابع مشابه
Adaptive neuro-fuzzy inference system and neural network in predicting the size of monodisperse silica and process optimization via simulated annealing algorithm
In this study, Back-propagation neural network (BPNN) and adaptive neuro-fuzzy inference system (ANFIS) methods were applied to estimate the particle size of silica prepared by sol-gel technique. Simulated annealing algorithm (SAA) employed to determine the optimum practical parameters of the silica production. Accordingly, the process parameters, i.e. tetraethyl orthosilicate (TEOS), H2O and N...
متن کاملHybrid PSO-SA algorithm for training a Neural Network for Classification
In this work, we propose a Hybrid particle swarm optimization-Simulated annealing algorithm and present a comparison with i) Simulated annealing algorithm and ii) Back propagation algorithm for training neural networks. These neural networks were then tested on a classification task. In particle swarm optimization behaviour of a particle is influenced by the experiential knowledge of the partic...
متن کاملA cloud-based simulated annealing algorithm for order acceptance problem with weighted tardiness penalties in permutation flow shop scheduling
Make-to-order is a production strategy in which manufacturing starts only after a customer's order is received; in other words, it is a pull-type supply chain operation since manufacturing is carried out as soon as the demand is confirmed. This paper studies the order acceptance problem with weighted tardiness penalties in permutation flow shop scheduling with MTO production strategy, the objec...
متن کاملA Simulated Annealing-based Learning Algorithm for Block-Diagonal Recurrent Neural Networks
The RPROP algorithm was originally developed in [5] for static networks and constitutes one of the best performing first order learning methods for neural networks [6]. However, in RPROP the problem of poor convergence to local minima, faced by all gradient descent-based methods, is not fully eliminated. Hence, in an attempt to alleviate this drawback, a combination of RPROP with the global sea...
متن کاملAn Integrated Model of Project Scheduling and Material Ordering: A Hybrid Simulated Annealing and Genetic Algorithm
This study aims to deal with a more realistic combined problem of project scheduling and material ordering. The goal is to minimize the total material holding and ordering costs by determining the starting time of activities along with material ordering schedules subject to some constraints. The problem is first mathematically modelled. Then a hybrid simulated annealing and genetic algorithm is...
متن کامل